MEPS_IBTS_1

Andrew Edwards

Pre-processing of IBTS data to get desired data for MEPS Table 1

These are the steps for pre-processing the IBTS data to retain just the desired elements for the analyses in the MEPS paper. Should be useful for repeating the analyses with updated data (if so it would be good to use the icesDATRAS package to automate the downloading of the data). This may also be useful for anyone wanting to extract data and do similar analyses. This is not functionalised, but is better to work through the steps in an Rmarkdown-type format to understand and check what is going on.

library(sizeSpectra)

Data were extracted from the IBTS DATRAS website by Julia Blanchard who then undertook some initial processing, as detailed in Supplementary Material of MEPS paper, and also included with the species-specific length-weight parameters from Fung et al. (2012). The data are saved within this package as dataOrig (see data-raw/IBTS-data.R).

First, understand the data:

dim(dataOrig)
#> [1] 178435     13
names(dataOrig)
#>  [1] "AphiaID"              "Survey"               "Year"                
#>  [4] "Quarter"              "Area"                 "Species"             
#>  [7] "LngtClas"             "CPUE_number_per_hour" "Taxonomic.group"     
#> [10] "a"                    "b"                    "weight_g"            
#> [13] "CPUE_bio_per_hour"
dataOrig[1:5,1:7]
#> # A tibble: 5 x 7
#>   AphiaID Survey   Year Quarter  Area Species          LngtClas
#>     <int> <fct>   <int>   <int> <int> <fct>               <int>
#> 1  101170 NS-IBTS  2003       1     4 Myxine glutinosa      360
#> 2  101170 NS-IBTS  2008       1     2 Myxine glutinosa        0
#> 3  101170 NS-IBTS  2004       1     5 Myxine glutinosa        0
#> 4  101170 NS-IBTS  2004       1     1 Myxine glutinosa      330
#> 5  101170 NS-IBTS  2013       1     4 Myxine glutinosa      330
dataOrig[1:5,8:13]
#> # A tibble: 5 x 6
#>   CPUE_number_per_h~ Taxonomic.group      a     b weight_g CPUE_bio_per_ho~
#>                <dbl> <fct>            <dbl> <dbl>    <dbl>            <dbl>
#> 1             0.0556 Myxine glutino~ 0.0033  2.70     52.4             2.91
#> 2             0      Myxine glutino~ 0.0033  2.70      0               0   
#> 3             0      Myxine glutino~ 0.0033  2.70      0               0   
#> 4             0.0244 Myxine glutino~ 0.0033  2.70     41.4             1.01
#> 5             0.0909 Myxine glutino~ 0.0033  2.70     41.4             3.76

summary(dataOrig)
#>     AphiaID           Survey            Year         Quarter 
#>  Min.   :101170   NS-IBTS:178435   Min.   :1986   Min.   :1  
#>  1st Qu.:126436                    1st Qu.:1993   1st Qu.:1  
#>  Median :126445                    Median :2001   Median :1  
#>  Mean   :125728                    Mean   :2001   Mean   :1  
#>  3rd Qu.:127140                    3rd Qu.:2008   3rd Qu.:1  
#>  Max.   :274304                    Max.   :2015   Max.   :1  
#>                                                              
#>       Area                           Species          LngtClas     
#>  Min.   :1.000   Gadus morhua            : 25832   Min.   :   0.0  
#>  1st Qu.:2.000   Amblyraja radiata       :  9316   1st Qu.: 120.0  
#>  Median :4.000   Clupea harengus         :  8919   Median : 230.0  
#>  Mean   :3.839   Merlangius merlangus    :  7676   Mean   : 271.3  
#>  3rd Qu.:6.000   Melanogrammus aeglefinus:  6527   3rd Qu.: 370.0  
#>  Max.   :7.000   Pleuronectes platessa   :  6403   Max.   :1500.0  
#>                  (Other)                 :113762                   
#>  CPUE_number_per_hour                 Taxonomic.group         a           
#>  Min.   :   0.000     Gadus morhua            : 25832   Min.   :0.000100  
#>  1st Qu.:   0.040     Amblyraja radiata       :  9316   1st Qu.:0.003500  
#>  Median :   0.121     Conger conger           :  8919   Median :0.004200  
#>  Mean   :  11.972     Merlangius merlangus    :  7676   Mean   :0.006864  
#>  3rd Qu.:   0.717     Melanogrammus aeglefinus:  6527   3rd Qu.:0.007100  
#>  Max.   :7207.821     Pleuronectes platessa   :  6403   Max.   :0.235000  
#>                       (Other)                 :113762                     
#>        b            weight_g        CPUE_bio_per_hour  
#>  Min.   :1.797   Min.   :    0.00   Min.   :     0.00  
#>  1st Qu.:3.079   1st Qu.:    8.02   1st Qu.:     1.97  
#>  Median :3.156   Median :   80.87   Median :    27.94  
#>  Mean   :3.146   Mean   :  700.64   Mean   :   469.33  
#>  3rd Qu.:3.243   3rd Qu.:  439.08   3rd Qu.:   171.95  
#>  Max.   :3.527   Max.   :35630.04   Max.   :310586.16  
#> 

Note that LngtClas is in mm, not cm, but that a and b are the length-weight coefficients for the length being in cm. Will use cm as units later.

Some columns are duplicated and we just want to keep the useful ones. AphiaID is a numerical code for each species. Need to know the number of areas, but don’t need to keep Area.

Survey and Quarter are the same for all entries, and we don’t need to keep Area, just need the number of areas.

numAreas = length(unique(dataOrig$Area))
numAreas
#> [1] 7
colsKeep = c("Year",
             "AphiaID",
             "LngtClas",
             "CPUE_number_per_hour",
             "a",
             "b",
             "weight_g",
             "CPUE_bio_per_hour")
colsDiscard = setdiff(names(dataOrig), colsKeep)
colsDiscard
#> [1] "Survey"          "Quarter"         "Area"            "Species"        
#> [5] "Taxonomic.group"

Note that data will change a lot in the following code.

data = sizeSpectra::s_select(dataOrig, colsKeep)   # uses Sebastian Kranz's s_dplyr_funcs.r
data
#> # A tibble: 178,435 x 8
#>     Year AphiaID LngtClas CPUE_number_per~      a     b weight_g
#>    <int>   <int>    <int>            <dbl>  <dbl> <dbl>    <dbl>
#>  1  2003  101170      360           0.0556 0.0033  2.70     52.4
#>  2  2008  101170        0           0      0.0033  2.70      0  
#>  3  2004  101170        0           0      0.0033  2.70      0  
#>  4  2004  101170      330           0.0244 0.0033  2.70     41.4
#>  5  2013  101170      330           0.0909 0.0033  2.70     41.4
#>  6  2013  101170        0           0      0.0033  2.70      0  
#>  7  2003  101170        0           0      0.0033  2.70      0  
#>  8  2012  101170      250           0.04   0.0033  2.70     19.6
#>  9  1999  101170        0           0      0.0033  2.70      0  
#> 10  2003  101170      290           0.0972 0.0033  2.70     29.2
#> # ... with 178,425 more rows, and 1 more variable: CPUE_bio_per_hour <dbl>
# str(data)
summary(data)
#>       Year         AphiaID          LngtClas      CPUE_number_per_hour
#>  Min.   :1986   Min.   :101170   Min.   :   0.0   Min.   :   0.000    
#>  1st Qu.:1993   1st Qu.:126436   1st Qu.: 120.0   1st Qu.:   0.040    
#>  Median :2001   Median :126445   Median : 230.0   Median :   0.121    
#>  Mean   :2001   Mean   :125728   Mean   : 271.3   Mean   :  11.972    
#>  3rd Qu.:2008   3rd Qu.:127140   3rd Qu.: 370.0   3rd Qu.:   0.717    
#>  Max.   :2015   Max.   :274304   Max.   :1500.0   Max.   :7207.821    
#>        a                  b            weight_g        CPUE_bio_per_hour  
#>  Min.   :0.000100   Min.   :1.797   Min.   :    0.00   Min.   :     0.00  
#>  1st Qu.:0.003500   1st Qu.:3.079   1st Qu.:    8.02   1st Qu.:     1.97  
#>  Median :0.004200   Median :3.156   Median :   80.87   Median :    27.94  
#>  Mean   :0.006864   Mean   :3.146   Mean   :  700.64   Mean   :   469.33  
#>  3rd Qu.:0.007100   3rd Qu.:3.243   3rd Qu.:  439.08   3rd Qu.:   171.95  
#>  Max.   :0.235000   Max.   :3.527   Max.   :35630.04   Max.   :310586.16
min(data$CPUE_number_per_hour)
#> [1] 0

So no negative CPUE values or spurious weights. There are a lot of zero CPUE values:

sum(data$CPUE_number_per_hour == 0)
#> [1] 22915

Want to end up with data in a standard format (based on some original analysis I did when writing the code). Need to rename some of the headings, make the lengths in cm not mm, and (for helpfulness) order by Year, SpecCode and then Lgnt:

  1. Rename the columns:
if(sum( colsKeep != c("Year", "AphiaID", "LngtClas", "CPUE_number_per_hour",
    "a", "b", "weight_g", "CPUE_bio_per_hour")) > 0)
       { stop("Need to adjust renaming") }
names(data) = c("Year", "SpecCode", "LngtClass", "Number", "LWa", "LWb",
         "bodyMass", "CPUE_bio_per_hour")
# CPUE_bio_per_hour is Number * bodyMass
  1. Make cm not mm:
data$LngtClass = data$LngtClass/10
  1. Rearrange the order to be more intuitive:
data = dplyr::arrange(data, Year, SpecCode, LngtClass)
data
#> # A tibble: 178,435 x 8
#>     Year SpecCode LngtClass Number    LWa   LWb bodyMass CPUE_bio_per_hour
#>    <int>    <int>     <dbl>  <dbl>  <dbl> <dbl>    <dbl>             <dbl>
#>  1  1986   105814         0      0 0.0031  3.03        0                 0
#>  2  1986   105814         0      0 0.0031  3.03        0                 0
#>  3  1986   105814         0      0 0.0031  3.03        0                 0
#>  4  1986   105814         0      0 0.0031  3.03        0                 0
#>  5  1986   105814         0      0 0.0031  3.03        0                 0
#>  6  1986   105814         0      0 0.0031  3.03        0                 0
#>  7  1986   105814         0      0 0.0031  3.03        0                 0
#>  8  1986   105814         0      0 0.0031  3.03        0                 0
#>  9  1986   105814         0      0 0.0031  3.03        0                 0
#> 10  1986   105814         0      0 0.0031  3.03        0                 0
#> # ... with 178,425 more rows

That shows that we have a lot of (i) repeated values that can be amalgamated (presumably repeated because at one point the data included details about trawls, or it’s just how the data were obtained), (ii) lots of Number == 0 that we can discard, though keep for now since will help verify the binning.

  1. So, each row represents a combination of Year, SpecCode, LngtClass, but these aren’t unique. For example, looking at just one species for one year for one length class:
exampleSp = dplyr::filter(data, Year == 1986, SpecCode == 105814, LngtClass == 60)
exampleSp
#> # A tibble: 2 x 8
#>    Year SpecCode LngtClass Number    LWa   LWb bodyMass CPUE_bio_per_hour
#>   <int>    <int>     <dbl>  <dbl>  <dbl> <dbl>    <dbl>             <dbl>
#> 1  1986   105814        60 0.0833 0.0031  3.03     754.              62.8
#> 2  1986   105814        60 0.0833 0.0031  3.03     754.              62.8

So, yes, we have multiple counts of 60cm fish of this species, which we can just aggregate together. Do this for all years, species and lengths:

data = dplyr::summarise(dplyr::group_by(data,
                                        Year,
                                        SpecCode,
                                        LngtClass),
                        "Number" = sum(Number)/numAreas,
                        "LWa" = unique(LWa),
                        "LWb" = unique(LWb),
                        "bodyMass" = unique(bodyMass))
data
#> # A tibble: 49,191 x 7
#> # Groups:   Year, SpecCode [2,550]
#>     Year SpecCode LngtClass  Number    LWa   LWb bodyMass
#>    <int>    <int>     <dbl>   <dbl>  <dbl> <dbl>    <dbl>
#>  1  1986   105814         0 0       0.0031  3.03     0   
#>  2  1986   105814        10 0.0571  0.0031  3.03     3.31
#>  3  1986   105814        45 0.00714 0.0031  3.03   315.  
#>  4  1986   105814        46 0.00714 0.0031  3.03   337.  
#>  5  1986   105814        50 0.00714 0.0031  3.03   434.  
#>  6  1986   105814        52 0.0293  0.0031  3.03   489.  
#>  7  1986   105814        53 0.0109  0.0031  3.03   518.  
#>  8  1986   105814        54 0.0113  0.0031  3.03   548.  
#>  9  1986   105814        56 0.0218  0.0031  3.03   612.  
#> 10  1986   105814        57 0.0188  0.0031  3.03   646.  
#> # ... with 49,181 more rows
summary(data)
#>       Year         SpecCode        LngtClass          Number         
#>  Min.   :1986   Min.   :101170   Min.   :  0.00   Min.   :   0.0000  
#>  1st Qu.:1993   1st Qu.:126426   1st Qu.: 14.00   1st Qu.:   0.0087  
#>  Median :2001   Median :126461   Median : 27.00   Median :   0.0347  
#>  Mean   :2001   Mean   :124528   Mean   : 32.48   Mean   :   6.2037  
#>  3rd Qu.:2008   3rd Qu.:127140   3rd Qu.: 45.00   3rd Qu.:   0.2604  
#>  Max.   :2015   Max.   :274304   Max.   :150.00   Max.   :1591.4413  
#>       LWa                LWb           bodyMass       
#>  Min.   :0.000100   Min.   :1.797   Min.   :    0.00  
#>  1st Qu.:0.003400   1st Qu.:3.054   1st Qu.:   17.05  
#>  Median :0.004200   Median :3.147   Median :  152.50  
#>  Mean   :0.008107   Mean   :3.129   Mean   :  918.59  
#>  3rd Qu.:0.007800   3rd Qu.:3.243   3rd Qu.:  718.34  
#>  Max.   :0.235000   Max.   :3.527   Max.   :35630.04

dplyr::filter(data, SpecCode == 105814, Year == 1986, LngtClass == 60)
#> # A tibble: 1 x 7
#> # Groups:   Year, SpecCode [1]
#>    Year SpecCode LngtClass Number    LWa   LWb bodyMass
#>   <int>    <int>     <dbl>  <dbl>  <dbl> <dbl>    <dbl>
#> 1  1986   105814        60 0.0238 0.0031  3.03     754.

So Number here correctly equals the sum of the first two rows of exampleSp divided by seven (areas).

Number is the average number (of each species and length) caught per hour of trawling across all seven areas.

Just confirm the calculations for bodyMass (body mass of an individual of that LngtClass) since they were done during preprocessing; should get the same answer, using species-specific length-weight coversions.

data = dplyr::mutate(data,
                     bodyMass2 = LWa * LngtClass^LWb)
if(max(abs(data$bodyMass2 - data$bodyMass)) > 0.0001) stop("Check conversions")
data = dplyr::select(data, -bodyMass2)              # don't keep the confirming column

Now only include body-mass classes above 4 g, following Blanchard et al. (2005), since data are unreliable for smaller organisms:

range(data$LngtClass)
#> [1]   0 150
range(data$bodyMass)
#> [1]     0.00 35630.04
sum(data$bodyMass == 0)   # 2549
#> [1] 2549
sum(data$bodyMass < 4 )   # 6893
#> [1] 6893
data = dplyr::filter(data, bodyMass >= 4)
range(data$bodyMass)
#> [1]     4.045665 35630.037377
data
#> # A tibble: 42,298 x 7
#> # Groups:   Year, SpecCode [2,182]
#>     Year SpecCode LngtClass  Number    LWa   LWb bodyMass
#>    <int>    <int>     <dbl>   <dbl>  <dbl> <dbl>    <dbl>
#>  1  1986   105814        45 0.00714 0.0031  3.03     315.
#>  2  1986   105814        46 0.00714 0.0031  3.03     337.
#>  3  1986   105814        50 0.00714 0.0031  3.03     434.
#>  4  1986   105814        52 0.0293  0.0031  3.03     489.
#>  5  1986   105814        53 0.0109  0.0031  3.03     518.
#>  6  1986   105814        54 0.0113  0.0031  3.03     548.
#>  7  1986   105814        56 0.0218  0.0031  3.03     612.
#>  8  1986   105814        57 0.0188  0.0031  3.03     646.
#>  9  1986   105814        58 0.0381  0.0031  3.03     680.
#> 10  1986   105814        59 0.0327  0.0031  3.03     717.
#> # ... with 42,288 more rows
summary(data)
#>       Year         SpecCode        LngtClass          Number         
#>  Min.   :1986   Min.   :101170   Min.   :  4.00   Min.   :   0.0003  
#>  1st Qu.:1993   1st Qu.:126436   1st Qu.: 19.00   1st Qu.:   0.0110  
#>  Median :2001   Median :126450   Median : 31.00   Median :   0.0387  
#>  Mean   :2001   Mean   :124117   Mean   : 36.99   Mean   :   5.1455  
#>  3rd Qu.:2008   3rd Qu.:127140   3rd Qu.: 49.00   3rd Qu.:   0.2657  
#>  Max.   :2015   Max.   :274304   Max.   :150.00   Max.   :1591.4413  
#>       LWa                LWb           bodyMass       
#>  Min.   :0.000100   Min.   :1.797   Min.   :    4.05  
#>  1st Qu.:0.003400   1st Qu.:3.054   1st Qu.:   45.96  
#>  Median :0.004200   Median :3.156   Median :  246.19  
#>  Mean   :0.008035   Mean   :3.131   Mean   : 1068.13  
#>  3rd Qu.:0.007800   3rd Qu.:3.243   3rd Qu.:  904.47  
#>  Max.   :0.235000   Max.   :3.527   Max.   :35630.04

Total number of fish in this dataset is

sum(data$Number)
#> [1] 217645.3

The unique length classes are:

sort(unique(data$LngtClass))
#>   [1]   4.0   5.0   6.0   7.0   8.0   9.0  10.0  10.5  11.0  11.5  12.0
#>  [12]  12.5  13.0  13.5  14.0  14.5  15.0  15.5  16.0  16.5  17.0  17.5
#>  [23]  18.0  18.5  19.0  19.5  20.0  20.5  21.0  21.5  22.0  22.5  23.0
#>  [34]  23.5  24.0  24.5  25.0  25.5  26.0  26.5  27.0  27.5  28.0  28.5
#>  [45]  29.0  29.5  30.0  30.5  31.0  31.5  32.0  32.5  33.0  33.5  34.0
#>  [56]  34.5  35.0  35.5  36.0  36.5  37.0  38.0  39.0  40.0  41.0  42.0
#>  [67]  43.0  44.0  45.0  46.0  47.0  48.0  49.0  50.0  51.0  52.0  53.0
#>  [78]  54.0  55.0  56.0  57.0  58.0  59.0  60.0  61.0  62.0  63.0  64.0
#>  [89]  65.0  66.0  67.0  68.0  69.0  70.0  71.0  72.0  73.0  74.0  75.0
#> [100]  76.0  77.0  78.0  79.0  80.0  81.0  82.0  83.0  84.0  85.0  86.0
#> [111]  87.0  88.0  89.0  90.0  91.0  92.0  93.0  94.0  95.0  96.0  97.0
#> [122]  98.0  99.0 100.0 101.0 102.0 103.0 104.0 105.0 106.0 107.0 108.0
#> [133] 109.0 110.0 111.0 112.0 113.0 114.0 115.0 116.0 117.0 118.0 119.0
#> [144] 120.0 121.0 122.0 123.0 124.0 125.0 126.0 127.0 128.0 129.0 130.0
#> [155] 131.0 132.0 133.0 134.0 135.0 138.0 139.0 140.0 142.0 144.0 145.0
#> [166] 146.0 149.0 150.0
diff(sort(unique(data$LngtClass)))
#>   [1] 1.0 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
#>  [18] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
#>  [35] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
#>  [52] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#>  [69] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#>  [86] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [103] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [120] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [137] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [154] 1.0 1.0 1.0 1.0 1.0 3.0 1.0 1.0 2.0 2.0 1.0 1.0 3.0 1.0

The 0.5-cm length classes are only for two species, Atlantic Herring (code 126417) and European Sprat (code 126425), as confirmed here (no differences of 0.5cm):

temp = dplyr::filter(data, !(SpecCode %in% c(126417, 126425)))
unique(diff(sort(unique(temp$LngtClass))))
#> [1] 1 3 2

Aside – species names and codes are in specCodeNames (though it needs updating as more species in the data than listed here):

specCodeNames
#> # A tibble: 111 x 2
#>    species             speccode
#>    <fct>                  <int>
#>  1 Agonus cataphractus   127190
#>  2 Alosa alosa           126413
#>  3 Alosa fallax          126415
#>  4 Amblyraja radiata     105865
#>  5 Ammodytes marinus     126751
#>  6 Ammodytidae           125516
#>  7 Anarhichas lupus      126758
#>  8 Argentina silus       126715
#>  9 Argentina sphyraena   126716
#> 10 Arnoglossus           126109
#> # ... with 101 more rows
length(unique(specCodeNames$speccode))   # checking speccode are unique
#> [1] 111

Need this to stop earlier groups being kept (can mess up later code):

data = dplyr::ungroup(data)

These next commands here (which are not run in this vignette) are to save IBTS_data in the package (which has already been run once to build the package). Rename and save data with a meaningful name for your own data.

IBTS_data = data
usethis::use_data(IBTS_data, overwrite = TRUE)

So we have the following, where each row is a unique combination of Year, SpecCode and LngtClass (cm, the minimum value of the 1-cm length bin [or 0.5-cm bin for Atlantic Herring and European Sprat]), and Number gives the number of individuals per hour of trawling observed for the combination. Parameters LWa and LWb are the length-weight cofficients for that species from Fung et al. (2012), bodyMass (g) is the resulting estimated body mass for an individual of that species and length class and Biomass (g h-1) is calculated here as the total biomass caught per hour of trawling for each row.

The resulting Table 1 is:

data_biomass <- dplyr::mutate(data,
                              Biomass = Number * bodyMass)
knitr::kable(rbind(data_biomass[1:6,],
                   data_biomass[(nrow(data_biomass)-5):nrow(data_biomass),
                                     ]),
             digits=c(0, 0, 0, 3, 4, 4, 2, 2))
Year SpecCode LngtClass Number LWa LWb bodyMass Biomass
1986 105814 45 0.007 0.0031 3.0290 315.46 2.25
1986 105814 46 0.007 0.0031 3.0290 337.17 2.41
1986 105814 50 0.007 0.0031 3.0290 434.05 3.10
1986 105814 52 0.029 0.0031 3.0290 488.81 14.33
1986 105814 53 0.011 0.0031 3.0290 517.84 5.65
1986 105814 54 0.011 0.0031 3.0290 548.00 6.18
2015 154675 34 0.028 0.0244 2.0439 32.93 0.92
2015 274304 8 0.013 0.0080 3.1410 5.49 0.07
2015 274304 14 0.039 0.0080 3.1410 31.85 1.24
2015 274304 15 0.052 0.0080 3.1410 39.55 2.05
2015 274304 16 0.065 0.0080 3.1410 48.44 3.15
2015 274304 17 0.013 0.0080 3.1410 58.60 0.76